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Abstract 

Motivation: 



Algorithms for differential analysis of microarray data are vital to mod- 
ern biomedical research. Their accuracy strongly depends on effective 
^ treatment of inter-gene correlation. Correlation is ordinarily accounted 

for in terms of its effect on significance cut-offs. In this paper it is shown 
I— ' that correlation can, in fact, be exploited to share information across 

, tests, which, in turn, can increase statistical power. 

> 

5^ Results: 

fSJ Vastly and demonstrably improved differential analysis approaches are 

^s.^ the result of combining identifiability (the fact that in most microarray 

data sets, a large proportion of genes can be identified a priori as non- 
00 differential) with optimization criteria that incorporate correlation. As 

a special case, we develop a method which builds upon the widely 
^ used two-sample i-statistic based approach and uses the Mahalanobis 

distance as an optimality criterion. Results on the prostate cancer data 



of Singh et al. ( 2002 1 suggest that the proposed method outperforms 



all published approaches in terms of statistical power. 



Availability: 

The proposed algorithm is implemented in MATLAB and in R. The 
software, called Tellipsoid, and relevant data sets are available at |www. | 
egr . msu . edu/'^desaikey 



*to whom correspondence should be addressed; contact: desaikey@egr.msu.edu 
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1 Introduction 



Detecting differentially expressed genes in the presence of substantial inter- 
gene correlation is a challenging problem. Research has focused largely on 
understanding the harmful effects of correlation on the threshold settings 
demarcating null and non-null genes. In fact, however, the nominally con- 
founding correlation can be exploited to increase statistical power of microar- 
ray studies. This observation has engendered a fruitful research direction 
introduced in this paper. 

The literature is not devoid of attempts to develop more powerful sum- 
mary statistics which exploit correlation, but such efforts have not produced 
compelling results. We posit that the limitations of such developments are 
due, at least in part, to neglecting identifiability - the fact that in most 
microarray data sets, a large proportion of genes requires no technical effort 
to be identified as explicitly non-differential. 

In this paper, we present a new differential analysis method which ex- 
ploits identifiability through an optimization criterion which incorporates 
inter-gene correlation. The method builds upon the widely-used two-sample 
t-statistic approach with a minimization of the Mahalanobis distance as the 
optimality criterion. Although this paper focuses primarily on two-sample 
microarray studies, the framework is readily generalizable for use in studies 
with multiple or continuous covariates, as well as to other multiple compar- 
ison applications. 

Let us assume that m genes are measured on n microarrays, under two 
different experimental conditions, such as control and treatment. Based 
on the gene expression matrix X, we are interested in identifying a small 
number of genes that are responsible for class distinction. One widely used 
strategy [e.g., Tusher et al. (20011 and Efron (2004])] is to represent each 
gene by its null hypothesis and two-sample t-statisticQsay 



Null hypotheses: 
Test statistics: 



Hi, H2, . . . , Hm 
tl , t2 ) • • • ) tm,' 



The magnitudes of the t-statistics establish a gene-ranking and the R 
(presumably <C m) genes with the largest t-scores are reported as statisti- 
cally significant discoveries. The investigator can either explicitly supply a 
R or rely on the false discovery rate (FDR) calculations to find a maximal 
R with the allowable FDR. 

The issue of correctly estimating the FDR in the presence of correlation 
has received much recent attention because highly correlated tests increase 
the variance of the FDR leading to unreliable results (Owen 20051. As 

r "over powered" X's, 



Efron 


(2007 


1 and 


Desai et al. 


(2007 



there may be significantly fewer tail-area null counts than expected, while for 



^nuU = non-differential; non-null = differential 
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"under powered" X's, the situation can worsen with many more tail-area 
null counts than expected. Importantly though, techniques for correctly 
estimating the FDR do not change the gene-ranking, but only the size of 
the reported list. 

The present research was motivated by the notion that, for "under pow- 
ered" X's, it might be possible to exploit correlation across tj's to establish 
a gene ranking that has better statistical power than the raw tj based rank- 
ing. The method that resulted from an exploration of this question indeed 
seems to improve the statistical power of all X's. The proposed method 
uses, (i) a vector of i-statistics (t = [ti, t2, • • • , im]"^) and (ii) an estimate of 
the covariance matrix of the vector t, to output a substantially revised ver- 
sion of t, denoted u|t, whose corresponding entries can be used to establish 
an improved gene-ranking. 

The method is summarized as follows. Lel|^t ~ (u, E). Now based on 
the measured value of t, we can estimate u|t, but while doing so we invoke 
the zero assumption (ZA) (Efron 20061 that the smallest P{%) of ij's are 
associated with null genes. Based on the ZA, we can set corresponding 
entries of u to zero. For the remaining entries of u we obtain minimum 
Mahalanobis distance (Mahalanobis 19361 estimates. 

Inter-gene correlation causes t to distribute around its center of mass 
in an hyperellipsoidal manner, and the Mahalanobis distance is a natural 
way to measure vector distances in such a distribution. In fact, the name 
Tellipsoid is adopted to emphasize the geometric intuition of tracking the 
center of an ellipsoid. We have done extensive experimentation with both 
real and simulated data and found that for a truly null ti which happens to be 
in tail-area, the corresponding Ui\t consistently tends to zero (its theoretical 
value) . 

Two prior research efforts were particularly useful in formulating the 
present approach. Storey et al. (20071, present a more general approach 
to the notion of sharing information across tj's, which they describe as 
"borrowing strength across the tests," for a potential increase in statistical 
power. Tibshirani and Wasserman (20061, discuss a new statistic called the 
"correlation-shared" i-statistic, and derive theoretical bounds on its perfor- 
mance; however, their approach requires strong assumptions regarding the 
nature of correlation between null and non-null genes which may not hold 
in many real data sets. The proposed method requires no such assumptions 
and yet provides considerable increase in power as demonstrated by results 
reported in Section |4| 

The outline of this paper is as follows. Section [2] defines, and obtains 
closed form expressions for, the minimum Mahalanobis distance estimates in 
the i-statistic space. Section [3] builds on the theory of Section [2] to develop 
the Tellipsoid gene-ranking method. In Section |4] we apply Tellipsoid to 

^No distributional information nor higher order statistics of t are assumed. 
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real cancer data of Singh et al. ( |2002[ ) and compare its accuracy with the 



state-of-the-art differential analysis tools EDGE (Storey et al. , 2007) and 



SAM (Tusher et al. 2001 ). Section |4] also discusses the implications of these 
results. We conclude with Section |5] summarizing the main ideas. 



2 Approach 

2.1 Per gene summary statistic 

Let X be the mx n matrix of gene expressions, for m genes and n samples. 
We assume that the samples fall into two groups k = 1 and k = 2 and there 
are samples in group k with ni + n2 = n. We start with the standard 
(unpaired) t-statistic: 

Xr,2 - Xi-i 



(1) 



where Xi-k is the mean of gene i in group k and Si is the pooled within-group 



standard deviation of gene i. If the i gene is indeed null, then we expect 
ti ~ (0,z//(i^ ~ 2)), where i' is the number of degrees of freedom, obtained 
from either the unpaired t-test theory or the permutation null calculations 
Efron (2007 1 . If gene i is non-null, then we expect ti ~ {ui,af). For 



as m 



non-null genes, the values of Ui and cxj depend on the amount of up / down 
regulation, the number of samples in each group, and ly. 



2.2 The zero assumption 

Without loss of generality, we may assume that the genes are indexed so 
that 

|il| < 1*21 < ••• < |tm|. (2) 

Then, a reasonable way to impose identifiability on null genes is through 
the ZA, namely, that P{%) of the genes - those with the smallest t statis- 
tics - are null. Efron (2006) discusses the use of the ZA in a variety of 
differential analysis approaches. It plays a central role in the literature on 



estimating the proportion of null genes, as in Pawitan et al. (2005) and 



Langaas et al. (2005). The ZA is equally crucial for the two-group model 
approach developed in the Bayesian microarray literature, as in 



Lee et al. 



(20001, Newton et al. (2001), and Efron et al. (2001). Furthermore, the 



claim that the method in Storey ( 2002 ) improves upon the well-known Ben 



jamini and Hochberg (1995) FDR procedure (in terms of statistical power) 



also crucially relies on an adaptive version of the ZA. The use of the ZA is 
justified in the present situation as long as P is sufficiently small so that the 
bottom P{%) genes would almost certainly be declared null for reasonable 
FDR's. 
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Formally, the ZA is stated as follows: Let c be the largest integer (gene 
index) such that c/m < P/100, denoted 

c=r0.01mP], (3) 

then genes with indices 1,2, ... ,c are assumed null. Let us partition the set 
of t statistics into those corresponding to genes declared null under the ZA, 
{ti ,t2, . . ■ , tc} , and those for the remaining m — c genes which continue to 
compete for the non-null designation, {tc+i,ic+2, • • • ,tm}- For convenience, 
we introduce the following vector notation, 

* = [ ^l-c *c+l:m ] = 

Then the random vector t is distributed in the following way: 

t~(u,S), (4) 

where u is the underlying mean vector and S the covariance matrix. The 
corresponding partitions of (u, S) are denoted 

u=M and S=f^(oo) ^^'A. 

V"(i)/ V^(io) ^(11)/ 

The central hypothesis of this paper is that there is a vector, say u|t, whose 
elements represent a reordering of the elements of the t, such that gene- 
ranking represented by u|t has better statistical power for detecting non-null 
genes than that based on t itself. 

(In the present paper, we focus on mainly the second-moment distri- 
butional characteristics of t. However, in fact, if the gene expressions are 
normally distributed, then, perhaps, t is described more accurately by the 
multivariate Student distribution. Exploitation of this additional structure 
will be considered in future work.) 

2.3 Estimating u|t 

We are interested in obtaining an estimate of the vector mean u based on the 
observation t. This requires an appropriate metric in the space of t vectors, 
with which to quantify the distance of the observed t from the center of 
mass u, say dist(t,u). The £2 norm induces a useful metric between t and 
u provided that we first decorrelate the vector elements as S"^/^ (t — u), 
thus yielding 

dist(t,u) = ^||I]-V2(t-u) ||2 

= y/(t-u)^S-i(t-u). (5) 



\0) ^(1) 
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This weighted Euchdean distance is sometimes called the Mahalanobis dis- 



tance in the pattern classification literature as in Deller Jr et al. (1993), 



Devijver and Kittler (19821, etc. 

We can relate t to u through the Mahalanobis distance but while doing 
so we invoke the ZA, which, in turn, implies that the first c entries of u are 
zero. This yields the estimate 



u 





(1) 



u 



where u 



(1) 



argmin 



(6) 



t(0) 

'(1) - 



- 

U(l) 



^(00) 
^(10) 



^(01) 

^(11) 



-1 



t(0) 

'(1) - 



-0 

U(l) 



In effect, u^^^^ combines the identifiability information based on the ZA with 
the information about the covariance structure of t which too can be ob- 
tained from the measured X itself. Notably the optimization in Eqn. (|6| 
enjoys closed form solution: 



u 



(1) 



t(l) -^(10)5^(oo)*(0)- 



(7) 



The derivation leading from Eqn. ^ to Eqn. ([7| is provided in the Appendix. 
2.4 Estimating Cov(t,;,ti/) 

To estimate the required entries of XI, we make several observations. The va- 
lidity of these observations can be established through computer simulations 
using the MATLAB script testtcov.m available with the Tellipsoid software. 
(Equation ([T]) does not require the covariance between two non-null t^'s.) 
Observation 1. If genes i and i' both are null, then 



Cov(ti,tj/) CoT{Xi,Xi') 



(8) 



Efron 


(2007 


) and 


Owen (2005 



calculations. This observation maybe intuitive to the reader from Eqn. (jlj) 
itself, or it is easily verified through a computer simulation. 

Observation 2. Similarly, if the gene i is null and i' non-null (or con- 
versely), then 



C0Y{ti,ti 



n2 Cor(xi;i, + ni Cor(xt;2, Xi/;2) 
ni -I- n2 



(9) 



where Cor(xj;fc, denotes the correlation between gene i and i' within 
group k. Equation ^ accommodates the possibility that the correlation 
between a null and a non-null gene may change across groups. If this does 
not occur, then Eqn. (l9|) reduces to Eqn. (Isj). 
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Observation 3. Furthermore, if ni ~ n2 (which is the case for most mi- 
croarray studies), then Eqn. Q simphfies to: 

u ^ \ Cor(xi;i,3;i/.i) + Cor(xi;2,a;i/;2) v 
Cow{ti,ti>) K — ^. (10) 



Equations ([8| and (10) suggest that we may use sample correlation to 
estimate Cov(tj,tj/): 

Cov{ti,ti') (X I ' ^, (11) 



where Xij denotes the expression level of the i^^ gene measured on the j^^ 
microarray after subtracting the average response within each treatment 
group. The scale factors cancel in the terms '^(lo) and ^^g^, so that esti- 
mating v/{v-2) [see Eqn. ([s])] is not required. 

2.5 Tellipsoid equation 

In light of Eqn. ( |11[ ), Eqn. ([T]) takes the practical form 

"(1) = ~ ^(io)^(oo)t(o), (12) 

where C is the sample correlation matrix of the gene expression matrix 
X (after removing the treatment effects). In most cases computing the 
full matrix inverse (C^q^ is not necessary and solving the term C^pjt(o) 
through some efficient linear solver reduces the computation considerably. 



(See Subsection 3.3 for detail). 



3 Algorithm 
3.1 Tellipsoid 

This section outlines a self-contained differential analysis algorithm based on 
the ideas discussed in Section [2} Its name Tellipsoid was coined to emphasize 
the geometric intuition of tracking the center of an ellipsoid. 

Tellipsoid takes gene expression matrix X and provides a specified num- 
ber, R, of the most differentially expressed genes. In principle, the ranking 
is based on the set {«*} from Eqn. (|6]). In practice, we rely on the estimates 



{u*} from Eqn. (12) 



The algorithm begins by reindexing the genes based on their two-sample 
t-statistics [Eqn. ([2])]. Then, based on the ZA, the first c genes are identified 
as null, as specified in Eqn. (|3|. By default, P is set to 50(%). Although the 
choice 50% is somewhat arbitrary, this fraction has worked well empirically 
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in the data sets tested. Future research may yield more rigorous methods 
for choosing P. 

In order to nulhfy any genuine treatment differences, X is converted 
to X by subtracting each genes average response within each treatment 
group. The sample correlation matrix C of X is computed subsequently. 
The crucial step is to compute u*-|^^ based on Eqn. (12). The elements of 

u* = [0^x1 (2(1))"^]"^ determine the gene-ranking: A gene with bigger is 
assigned a higher rank. The first R genes are reported as top R statistical 
discoveries. 



3.2 Numerical stability 

Because the number of samples n is often less than the number of genes 
m, the sample correlation matrix turns out to be singular and hence non- 
invertible. Therefore, we add a very small correction term (= 10"^") to its 
diagonal entries to make it nonsingular, and in effect, invertible. After this 
correction, Tellipsoid shows excellent numerical stability. 



3.3 Computational complexity 



Equation (12) involves matrix inversion, which, if performed in a naive way, 
could be a prohibitive operation, since microarray data sets may have several 
tens of thousand genes. Indeed, solving the term C^g^t(o) as a system of 

simultaneous linear equations (C(oo)^ = 't(o)) is much faster than explicitly 
computing C^^^. In particular, we can employ the Cholesky decomposi- 
tion to exploit the fact that the matrix C(oo) is symmetric and positive 
definite. MATLAB implementation of Tellipsoid uses the in-built linslove 
with appropriate settings, which, in turn, uses highly optimized routines of 
LAPACK (Linear Algebra PACKage— http : //www . netlib . org /lapack/^ . 

For the Prostate data (used in Section |4]) with 12625 genes and 102 
samples, Tellipsoid, running on a computer with a 2.2 GHz dual-core AMD 
Opteron processor with 8 GB of RAM and MATLAB version R2006b, re- 
quires just under 40 seconds to report the final gene list. For the same 
settings, the implementation with explicit matrix inversions takes ~ 10 min- 
utes. 



3.4 Summary of Tellipsoid 



Tellipsoid: An improved differential analysis algorithm 

Input: X = Labeled m x n gene expression matrix; R = Size of gene list 

Output: The gene list containing top R differentially expressed genes 

1. Calculate two-sample (unpaired) t-statistics: ti = {xi-2 — Xi-i)/si 
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2. Reindex genes such that < |t2| < • • • < \tm\ 

3. Gather first c = [O.OlmP] tj's in a vector t(o); By default P=50(%) 

4. Convert X to X by subtracting each genes average response within 
each treatment group 

5. Compute C = the sample correlation matrix of X 

6. Find G* = [Ojxi ^^"^^"^ %) = " ^(lO)^^oV(o) 

7. Determine gene-ranking: Assign a gene with bigger a higher rank 

8. Report top R genes as statistical discoveries 



4 Results and Discussion 

We compare Tellipsoid with two of the leading techniques, SAM [Significance 



Analysis of Microarrays (Tusher et al. 2001 1] and EDGE [Extraction and 
analysis of Differential Gene Expression (Leek et al. , 20061]. SAM adds a 



small exchangeability factor sq to the pooled sample variance when comput- 
ing the two-sample t-statistic: di = {xi-2 — x^i)/{si + sq); whereas EDGE 
is based on a general framework for sharing information across tests (see 



Storey et al. (2007|). EDGE is reported to show substantial improvement 



(in terms of statistical power) over five of the leading techniques including 



riori probability approach of Lonnstedt and Speed ( 2002 1 . It is noteworthy 



SAM dStorey et al| |2007 ). The other four are: (i) t/F- test of|Kerr et al. 
( |2000[ ) and pudoit et al.| ( f2002D ; (ii) S hrunken t/F-test o f [Cui et al.| ( |2005^ 
(iii) The empirical Bayes local FDR of Efron et al. (2001 1; (iv) The a poste- 



that Tellipsoid shows a major improvement over EDGE itself. 

To determine the performance quality of various techniques, we focus 
primarily on the empirical FDR in the reported gene list: Empirical FDR = 
NoFP/i?, where NoFP = the number of false positives. Broadly speaking, 
smaller the FDR better the technique. 



4.1 Prostate data 



(Singh et al. 20021 studied m = 12625 genes on n = 102 oligonucleotide 
microarrays, comparing rii = 50 healthy males with n2 = 52 prostate cancer 
patients. The purpose of their study was to identify genes that might an- 
ticipate the clinical behavior of Prostate cancer. We downloaded the .GEL 
files from [http: //www-genome .wi .mit . edu/MPR/prostate[ The software 
RMAExpress (Irizarry et al. 20031 was used to obtain high quality gene 



expressions from these .GEL files. We let RMAExpress apply its in-built 
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background adjustment, however, the quantile normahzation was skipped. 
Each gene was represented in the final expression matrix X by the loga- 
rithm (base 10) of its expression level. Taking the log is thought to increase 
normality and stabilize across group standard deviations (Tsai et al. , 2003). 



4.1.1 Data with known results 

Algorithm testing required an expression matrix X with the knowledge of 
truly non-differential genes. At the same time, we wanted the inter-gene 
correlation in X to resemble that in the real microarray data. These two 
seemingly conflicting requirements were satisfied concurrently by row stan- 
dardizing a real X. The prostate cancer matrix X was transformed to X_ 
by subtracting each gene's average response within each treatment group, 
and by normalizing within group sample mean squares. That is, for each 

group k E {1,2}, (l/nfc)Ejlij = and (iK) ij,- = 1. Here, the 
sum runs over corresponding samples only. With this transformation, all 
genes have equal energy and yet the same within group inter-gene correlation 
structure as the original 

To generate a test data set from X, its 102 columns were randomly di- 
vided into groups of 50 [=ni) and 52 (=^2). Next m„ (m^i) genes were 
randomly chosen for up (down) regulation by adding a positive (negative) 
offset Xu {xd) to the corresponding entries in group 2. Various choices of 
{niu, md, Xu, Xd) were tested to represent a range of differential analysis sce- 
narios encountered in practice. 



4.1.2 Results for prostate data 

Two cases were studied. In the first, the proportion of truly differential 
genes, say pi, was taken to be relatively small: pi ~ 0.01-0.05. The second 
case employed a larger pi ~ 0.1. The former simulates microarray studies 
seeking genes that distinguish subtypes of cancer, diabetes, etc., whereas 
the latter resembles studies comparing healthy versus diseased cell activity. 

Results were obtained using the subroutines samr.r from the package 
"samr" and statex . r from the package "edge." Both routines compute their 
native gene summary statistics given the matrix X and corresponding col- 
umn labels. These statistics, in turn, can be used to determine top R genes. 
Case 1 [pi 0.025, m„=200, md=100, Xn=0.1, and Xd=-0.1]. Figure |1 (a) | 



shows plots of the FDRs for 40 different data sets with the size of the re- 
ported list, i?=300. A large value of R coincides with an attempt to extract 
as many differential expressions as possible, a desired goal especially in mi- 
croarray studies performed to identify genes that are to be explored further 



^Note. Normalizing within group sample mean squares to unity is not implemented 
in the Tellipsoid algorithm. 
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- experimentally or computationally - to gain better understanding of un- 
derlying gene networks. Since the differential signal Xm=0.1 and Xrf=-0.1 is 
rather weak, recovering a good list is not easy as evident from the results - 
among all methods only Tellipsoid achieved sufficiently low FDRs to rescue 
a few X's. 

1| ' ' ' 1 1i ' 1 ' 1 



□ □ 
Q0.5 ° \ 



□ □ 



□ 

□ □ 

□ [] 



10 20 30 
Data index 

(a) 



40 




10 20 30 
Data index 

(b) 



40 



Figure 1: FDRs for Case 1. The number of truly differential gene is 300. 
Panel (a) i?=300; Panel (b) ii=100. TeUipsoid offers better detection. 
Square (□) marker = Tellipsoid. Lines: solid = raw i-statistic; dotted = 
SAM; dashed = EDGE. 



Figure |l(b) presents results for i?=100. A smaller R would be cho- 
sen to identify high-quality class distinguishing features for gene-expression- 
profiling-based clinical diagnosis and prognosis, where the goal is to build 



accurate classifiers and predictors. Whereas Singh et al. (20021 build a clas- 



sifier around only 16 of 12625 features, they do discuss the need to include as 
many reliable features as possible. Remarkably, for 37 out of 40 X matrices, 
Tellipsoid reports gene lists with no false discoveries at all, while the other 
techniques fail to obtain a single gene list with an FDR<0.5. 

Case 2a [pi « 0.1, mu=600, 771^=600, x„=0.02, and Xrf=-0.02]. In this set 
of experiments, pi is increased, but the differential signal is reduced. This 
situation also proves to be challenging for the existing techniques. However, 
TeUipsoid provides the FDR of ~ 0.5 for i2=1200, and, again for i?=300, 
while it reports most gene lists with no false discoveries at all. 

Case 2b [pi ?s0.1, 771^=600, 771^=600, Xu=0.1, and Xd=-0.1]. This subcase 
is designed to assess the effects of small sample sizes on performance, ni and 
772 are both reduced to 20. We randomly chose 20 columns per group from 
the original prostate cancer X, and then applied the data generation process 



(including row standardization) detailed in Subsection 4.1.1 Reduction in 
the number of samples is compensated by increase in the differential signal. 
The FDRs for Tellipsoid, Fig. [3j are excellent suggesting that Tellipsoid 
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10 20 30 40 10 20 30 40 
Data index Data index 

(a) (b) 

Figure 2: FDRs for Case 2a. The number of truly differential gene is 1200. 
Panel (a) i?=1200; Panel (b) i?=300. Tellipsoid performs much better. 
Square (□) marker = Tellipsoid. Lines: solid = raw t-statistic; dotted = 
SAM; dashed = EDGE. 



increases power of small sample data sets too. 



4.2 Simulated data 



Before devising the test data setup of Subsection 4.1.1[ Tellipsoid was tested 
on several simulated data sets. Below we discuss some simulation results that 
shed further light on the small sample behavior. 

Let us denote by the i^^ column of a simulated expression matrix 
X. We assume that the random vector is multivariate Gaussian with 
mean and covariance matrix W. Each such column represents m=3226 
genes with a covariance matrix W that introduces roughly the same amount 



of correlation as found in the BRCA data of Hedenfalk et al. (2001 1 . We 
choose niu = 50, = 50, = l,Xrf = — l,ni = 10, and n2 = 10. Figure |4] 
shows plots of the FDRs for i?=50 and i?=100. Table [T] shows results for 
some X realization from Fig. |4(b)[ Shown are the top 100 values of ii* 
and each corresponding original ti with concomitant rank. With smaller n, 
preeminence of Tellipsoid with respect to existing techniques scales down a 
bit. Nevertheless, for i?=50 case, for 25 out of 40 simulated X realizations, 
Tellipsoid achieves a low FDR of ~ 0.1 or less. 

Interestingly, with a smaller n, SAM outperforms the other two tech- 
niques. This is not entirely surprising as a smaller n can make the noise in 
the per gene pooled variance Si (and possibly the equivalent quantity in the 
EDGE algorithm) more prominent. Nevertheless, SAM does mitigate this 
issue in some measure by using the exchangeability factor sq to adjust the 



effective pooled variance (Tusher et al. 2001). 
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Table 1: Tellipsoid in action with Top 100 u*'s. Corresponding tj's and their 
rank are also shown. The results are for some X realization from Fig. |4(b)[ 
Tellipsoid = 22 NoFPs; raw t-statistics = 68 NoFPs. Truly null genes are 
printed in bold-sans typeface. 
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Figure 3: FDRs for Case 2b. Panel (a) i?=1200; Panel (b) i?=300. The 
sample size is smaller than that in Cases 1 & 2a and yet Tellipsoid performs 
well. Square (□) marker = Tellipsoid. Lines: solid = raw t-statistic; dotted 
= SAM; dashed = EDGE. 



4.3 Discussion 



By allowing researchers to examine the simultaneous expressions of enor- 
mous numbers of genes, microarrays promised to revolutionize the under- 
standing of complex diseases and usher in an era of personalized medicine. 
However, the shift in perception of that promise is palpable in the literature. 



A 1999 Nature Genetics article (Lander, 19991 is entitled "Array of hope," 



but a 2005 Nature Reviews article (Frantz 20051 is entitled "An array of 



problems." It is not unusual for impacts of new technologies to be overes- 
timated when first deployed, then to have the expectations moderated as 
the technologies reveal new complexities in the problems they are designed 
to solve. In the study of microarray data, the need for exceeding care in 
the design and regularization of experiments and data collection are un- 
derstood to be critical, but the biggest hindrance to progress has been the 



data interpretation. In particular, as Efron (20071 and Owen (20051 point 



out, the biggest challenge seems to be the treatment of intrinsic inter-gene 
correlation. 

In most microarray data there are at least three vital resources: (i) 
identifiability (ii) immense parallel structure, and (iii) inter-gene correlation 



itself. Thoughtful analyses in the papers by Efron (Efron 2005 2000) have 



suggested this view of the rich information structure inherent in the data. 
In this light, Tellipsoid can be viewed as exploiting more than correlation as 
a means of sharing information across tests, as it also involves identifiability. 

A crucial step in formulating Tellipsoid was the comprehension of the 
effects of inter-gene correlation on Cov(ti,tj/). In light of Observations 1-3, 
the choice of the Mahalanobis distance was intuitive, as it is already known 
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40 



Figure 4: FDRs for simulated data. Panel (a) i?=100; Panel (b) i?=50. 
(Small) sample sizes: ni=10, n2=10. Yet, Tellipsoid performs better than 
the rest. Square (□) marker = Tellipsoid. Lines: solid = raw t-statistic; 
dotted = SAM; dashed = EDGE. 



to give computationally attractive solutions through the matrix inversion 
lemma. 

Limited time and resources - and perhaps also the necessity for scientific 
focus - often require biomedical researchers to work on only a small number 
of "hot (gene) prospects." Even under such highly conservative conditions, 
however, misleading results can occur, as evident in the results of Figs.[TJ^ 
For all their careful development and statistical power, even state-of-the- 
art tools like EDGE and SAM can report spurious gene lists. The extra 
statistical power of Tellipsoid promises to further guard against anomalous 
results that can have serious consequences for the trajectory of a study of 
gene function, causation, and interaction. 



5 Conclusion 

This paper has reported the development and testing of a novel framework 
for the detection of differential gene expression. The framework combines 
the exploitation of inter-gene correlation to share information across tests, 
with identifiability - the fact that in most microarray data sets, a large pro- 
portion of genes can be identified a priori as non-differential. When applied 
to the widely used two-sample f-statistic approach, this viewpoint yielded 
an elegant differential analysis technique, Tellipsoid, which requires as in- 
puts only a gene expression matrix, related two-sample labels, and the size 
of desired gene-list R. Tellipsoid was tested on the prostate cancer data of 



Singh et al. (20021 and some simulated data. Compared to SAM (Tusher 



et al. 2001 1, EDGE (Storey et al. 2007), and the raw t-statistic approach it- 
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self, Tellipsoid shows substantial improvement in statistical power. Usually, 
with increase in microarray samples, power tends to increase considerably, 
but, even for small sample sizes, Tellipsoid 's performance improvement is 
noticeable. The software (coded in MATLAB and in R) and test data sets 



are available at [www . egr . msu . edu/~desaikey 
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APPENDIX 

Suppose that 



d) and =t(i) -U(i). 

We also have C = B^. Substituting these in Eqn. [g] yields: 

U(i) = argmin t(^)At(o) +2u('i)Ct(o) +U('i)Du(i). 



In Eqn. A-2 by setting the gradient w.r.t U(i) to 0, we obtain: 



u 



(1) 



(A-1) 



(A-2) 



(A-3) 



Now for 5] ^, we can appeal to the matrix inversion lemma (Golub and 



Van Loan 1996): 



-,-1 



(00) + ^(01)^ ^^(10)^(00)) ~^(00)^(01)Q ^ 

-Q""^^(io)^(oo) Q""^ . 



where Q = "^{ii) — ^(io)^(oo)^(oi)- Plugging this in Eqn. A-3 yields: 



u 



(1) 



^(10)^(00)^(0)- 



(A-4) 



Combining Eqn. A-4 with Eqn. |A-1 provides the desired expression: 



u 



(1) = ^1) 



'(10)^(00)^(0) 
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